3 Data Preparation#
Show code cell source
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np
# get custom color palette and colormap
from eda_helper import get_custom_palette, get_custom_colormap
Show code cell source
# create the file paths for reading in data and for outputting figures and tables
DATA_PATH = "../data/saville_row_east_west/"
OUTPUT_TABLES_PATH = "../output/tables/4/"
OUTPUT_FIGURES_PATH = "../output/figures/4/"
os.makedirs(DATA_PATH, exist_ok=True)
os.makedirs(OUTPUT_TABLES_PATH, exist_ok=True)
os.makedirs(OUTPUT_FIGURES_PATH, exist_ok=True)
custom_palette = get_custom_palette()
custom_colormap = get_custom_colormap()
# read in the files for exploration
east_df = pd.read_pickle(os.path.join(DATA_PATH, "east_df.pkl"))
west_df = pd.read_pickle(os.path.join(DATA_PATH, "west_df.pkl"))
west_complete_days_df = pd.read_pickle(
os.path.join(DATA_PATH, "west_complete_days_df.pkl")
)
east_complete_days_df = pd.read_pickle(
os.path.join(DATA_PATH, "east_complete_days_df.pkl")
)
# group on dt to remove trajectories
east_complete_days_df = east_complete_days_df.groupby("dt").agg(
{"value": "sum", "date": "first"}
)
west_complete_days_df = west_complete_days_df.groupby("dt").agg(
{"value": "sum", "date": "first"}
)
3.1 Data Selection Rationale#
Describe and justify the criteria and methods used to select the data from the initial dataset(s).
Discuss the relevance of the selected data to the data mining goals defined earlier.
Explain any assumptions made in this process and any implications for the final results.
Include the reasons for including or excluding certain attributes.
For the time-series we are going to search for and then select a full week of data.
Show code cell source
year = 2023
# Get unique week numbers in the data
unique_week_numbers = east_complete_days_df["date"].dt.isocalendar().week.unique()
# Loop through each week number
for week_number in unique_week_numbers:
# Select data for the specific week and year
ts_data = east_complete_days_df[
(east_complete_days_df["date"].dt.isocalendar().week == week_number)
& (east_complete_days_df["date"].dt.year == year)
]
# Check if the length of the selected week matches the expected length
is_complete = (4 * 24 * 7) == len(ts_data)
# Print the week number if it is complete
if is_complete:
print(f"Week {week_number} is complete.")
break
print(
f"Week {week_number} data stored in `df_selected_week` with length {len(ts_data)}"
)
plt.plot(ts_data["value"])
plt.xticks(rotation=90)
plt.title(f"Week {week_number} of {year}")
timeseries = ts_data[["value"]].values.astype("float32")
Week 6 is complete.
Week 6 data stored in `df_selected_week` with length 672
3.2 Data Cleaning Report#
Detail the process of cleaning the data i.e. dealing missing values, duplicate records, or inconsistent entries.
Discuss the methods used for cleaning (e.g., deletion, imputation, etc.) and a justification of choices.
Highlight any challenges faced during the data cleaning process.
Provide before and after summaries to demonstrate the effects of our cleaning efforts.
3.3 Derived Attributes / Records#
Discuss any new attributes or records that were derived from existing ones, such as creating new features through binning, combining attributes, or performing calculations.
Explain why these transformations were done, and how these new attributes will contribute to the data mining goals.
If applicable, describe any generation of synthetic data or aggregation of data at different levels.
We are going to add a number of new features that are derived from existing data. These are periods for the strongest signals calculated using the Lomb-Scargle Periodogram. The purpose of this is to give context to the model for the position of values in the time series. This is especially important as there are a large number of missing days now that all non-complete days have been removed.
3.4 Merged Data#
Describe any data that was merged or joined together from different sources or tables.
Discuss the methods used to combine the data and any difficulties faced (like inconsistencies or discrepancies between the datasets).
Explain how any issues were resolved (e.g., how discrepancies were reconciled).
Detail how the merged data was validated to ensure correctness.
We are going to add the term dates to the data for use in the model. This will be a boolean value.
3.5 Reformatted Data#
Describe the process of reformatting the data for modeling or visualization purposes.
Discuss any transformations of the data into different types (e.g., numerical to categorical, date-time conversions), creation of dummy variables, or re-scaling of features.
Explain how and why any re-sampling procedures were done, if applicable.
Discuss the format that the final dataset(s) will be in for the subsequent modeling or visualization stage.
First we need to scale the data to improve GD (Gradient Descent) convergence and avoid vanishing/exploding gradients both of which slow down model training. We will use sklearn.preprocessing.MinMaxScaler as our data is discrete and follows a heavily right skewed mixture distribution with a long tail and an exponential component. It is important that we preserve the shape of original distribution.
Show code cell source
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_timeseries = scaler.fit_transform(timeseries)
Model 0 - Baseline#
This model takes the current value as input and returns it for the next time step, effectively predicting “no change”. As our model can currently only predict a single time step into the future and the prediction will not change as window length increases. We can add an option to increase the number of predictions that the model makes however and see how performance changes as the number of time steps is increased. For single time-steps the baseline should be a reasonable estimate as the number of pedestrians generally changes slowly throughout the day. First we need to change our pandas.core.series.Series object into a numpy.ndarray object. We are dealing with a single dimensional array and the values form a time vector i.e. they are an ordered list.
We then want to split the data into windows. In our first test run, we use a week of continuous data, and split it into windows of WINDOWSIZE=1 and HORIZON=1. The baseline model will take the input and return that value as the output. We will then calculate some scores to see how far the predictions are from the labels.
Show code cell source
from IPython import display
import os
display.Image(
os.path.join(os.getcwd(), "graphics\sliding_window_w3h3s1.png"),
width=768,
height=768,
)
Show code cell source
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import matplotlib.animation as animation
import IPython.display as ipy
from IPython.display import HTML
from eda_helper import sliding_windows, plot_windows
WINDOW_SIZE = 4
HORIZON = 4
STRIDE = 4
inputs, targets = sliding_windows(scaled_timeseries, WINDOW_SIZE, HORIZON, STRIDE)
print(
f"Inputs array shape: {inputs.squeeze().shape} | Target array shape: {targets.squeeze().shape}"
)
print(f"Inputs type: {type(inputs)} | Targets type: {type(targets)}")
class Model0(nn.Module):
def __init__(self, horizon):
super().__init__()
self.horizon = horizon
def forward(self, x):
# Get the last element from each input sequence
last_values = x[:, -1:]
# Repeat this value to match the horizon length
return last_values.expand(-1, self.horizon)
baseline = Model0(HORIZON)
predictions = baseline(inputs.squeeze())
plot_windows(inputs, predictions, targets, num_plots=2, step=1)
Input shape: (4, 1) | Target shape: (4, 1)
Inputs array shape: torch.Size([167, 4]) | Target array shape: torch.Size([167, 4])
Inputs type: <class 'torch.Tensor'> | Targets type: <class 'torch.Tensor'>
Show code cell source
plt.ioff() # Turn off interactive plotting
fig, ax = plt.subplots(1, 1, figsize=(6, 4))
def animate(i):
ax.clear()
ax.plot(
range(len(inputs[i])),
inputs[i],
label="Inputs",
color=custom_palette[0],
)
ax.scatter(
range(len(inputs[i]), len(inputs[i]) + len(predictions[i])),
predictions[i],
label="Predictions",
color=custom_palette[1],
marker="x",
s=60,
)
ax.scatter(
range(len(inputs[i]), len(inputs[i]) + len(targets[i])),
targets[i],
label="Targets",
color=custom_palette[0],
)
ax.legend()
ax.set_title(f"Window {i+1}")
ax.set_ylim(0, inputs.max())
# Assign the FuncAnimation to a variable
ani = animation.FuncAnimation(fig, animate, frames=len(inputs[:96]), repeat=True)
# Create a display object
display_obj = HTML(ani.to_jshtml())
# Display the object
ipy.display(display_obj)
Model 0 Performance#
Show code cell source
targets.squeeze().flatten().shape, predictions.flatten().shape
(torch.Size([668]), torch.Size([668]))
Show code cell source
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Flattening predictions and targets for metric calculations
predictions_flat_np = predictions.flatten().detach().numpy()
targets_flat_np = targets.flatten().detach().numpy()
# Mean Absolute Error (MAE)
mae = mean_absolute_error(targets_flat_np, predictions_flat_np)
print(f"Mean Absolute Error (MAE): {mae}")
# Mean Squared Error (MSE)
mse = mean_squared_error(targets_flat_np, predictions_flat_np)
print(f"Mean Squared Error (MSE): {mse}")
# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")
# Symmetric Mean Absolute Percentage Error (SMAPE)
epsilon = 1e-10 # stops divide by zero error
smape = (
np.mean(
2.0
* np.abs(targets_flat_np - predictions_flat_np)
/ (np.abs(targets_flat_np) + np.abs(predictions_flat_np) + epsilon)
)
* 100
)
print(f"Symmetric Mean Absolute Percentage Error (SMAPE): {smape}%")
# R-squared (R²)
r2 = r2_score(targets_flat_np, predictions_flat_np)
print(f"R-squared (R²): {r2}")
# Create a dictionary with the metrics
metrics = {
"Model": "Baseline",
"MAE": mae,
"MSE": mse,
"RMSE": rmse,
"SMAPE": smape,
"R^2": r2,
}
performance_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "SMAPE", "R^2"])
# Append the dictionary to the DataFrame
metrics_df = pd.DataFrame(metrics, index=[0])
performance_df = pd.concat([performance_df, metrics_df])
# Print the DataFrame
print(performance_df.round(3))
Mean Absolute Error (MAE): 0.05719786882400513
Mean Squared Error (MSE): 0.0077348146587610245
Root Mean Squared Error (RMSE): 0.0879477933049202
Symmetric Mean Absolute Percentage Error (SMAPE): 42.62365400791168%
R-squared (R²): 0.8907532938409718
Model MAE MSE RMSE SMAPE R^2
0 Baseline 0.057 0.008 0.088 42.624 0.891
Model 1: Simple LSTM#
For our first LSTM model, we will use a single LSTM layer and a linear output layer. The model is only capable of providing a single time step output, so the distance we want to predict into the future depends on the frequency of data points that we feed into out from our generated windows. To begin with we attempt to predict 15 minutes into the future as this is the highest frequency of measurements we have for training. Later on we may aggregate to hourly or daily measurements to allow us to predict further into the future.
Show code cell source
train_size = int(len(scaled_timeseries) * 0.67)
test_size = len(scaled_timeseries) - train_size
train, test = scaled_timeseries[:train_size], scaled_timeseries[train_size:]
WINDOW_SIZE = 40
HORIZON = 1
X_train, y_train = sliding_windows(train, WINDOW_SIZE, HORIZON)
X_test, y_test = sliding_windows(test, WINDOW_SIZE, HORIZON)
print()
print(f"Train input tensor: {X_train.shape} | Train target tensor: {y_train.shape}")
print()
print(f"Test input tensor: {X_test.shape} | Test target tensor: {y_test.shape}")
Input shape: (40, 1) | Target shape: (1, 1)
Input shape: (40, 1) | Target shape: (1, 1)
Train input tensor: torch.Size([410, 40, 1]) | Train target tensor: torch.Size([410, 1, 1])
Test input tensor: torch.Size([182, 40, 1]) | Test target tensor: torch.Size([182, 1, 1])
Show code cell source
class LSTMModel1(nn.Module):
def __init__(self, *args, **kwargs) -> torch.Tensor:
super().__init__(*args, **kwargs)
self.lstm = nn.LSTM(
input_size=1, hidden_size=50, num_layers=1, batch_first=True
)
self.linear = nn.Linear(50, 1)
def forward(self, x):
x, _ = self.lstm(x)
x = self.linear(x)
return x[:, -1, :] # Selecting the last output of the sequence
Show code cell source
from matplotlib import animation
from IPython.display import HTML
import matplotlib.pyplot as plt
import matplotlib
# A function to train the model and record the training data
def train_model_and_record_data(
X_train,
y_train,
X_test,
y_test,
window_size=WINDOW_SIZE,
epochs=EPOCHS,
device=device,
):
LSTM_v1 = LSTMModel1().to(device)
optimiser = optim.Adam(LSTM_v1.parameters())
loss_fn = nn.MSELoss()
loader = data.DataLoader(
data.TensorDataset(X_train, y_train.squeeze(-1)), shuffle=True, batch_size=8
)
training_data_per_window = []
for epoch in range(epochs):
LSTM_v1.train()
for X_batch, y_batch in loader:
y_pred = LSTM_v1(X_batch)
loss = loss_fn(y_pred, y_batch)
optimiser.zero_grad()
loss.backward()
optimiser.step()
LSTM_v1.eval()
with torch.inference_mode():
y_pred = LSTM_v1(X_train)
train_rmse = np.sqrt(loss_fn(y_pred, y_train.squeeze(-1))).item()
test_pred = LSTM_v1(X_test)
test_rmse = np.sqrt(loss_fn(test_pred, y_test.squeeze(-1))).item()
train_plot = np.ones_like(timeseries) * np.nan
y_pred = LSTM_v1(X_train)
y_pred = y_pred[:, -1]
train_plot[window_size:train_size] = np.expand_dims(
y_pred.detach().numpy(), axis=1
)
test_plot = np.ones_like(timeseries) * np.nan
test_pred = LSTM_v1(X_test)[:, -1]
test_plot[train_size + window_size : len(timeseries)] = np.expand_dims(
test_pred.detach().numpy(), axis=1
)
# Add the data for this epoch to the list
training_data_per_window.append(
(epoch, train_plot, test_plot, train_rmse, test_rmse)
)
if (epoch + 1) % 10 != 0:
continue
print(
f"Window Size: {window_size} | Epoch: {epoch+1} | Train RMSE: {train_rmse:.4f} | Test RMSE: {test_rmse:.4f}"
)
return training_data_per_window
Show code cell source
# Main code
training_data = []
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)
EPOCHS = 20
WINDOW_SIZES = [2, 3, 4, 6, 8, 10, 15, 20]
for window_size in WINDOW_SIZES:
X_train, y_train = sliding_windows(train, window_size, HORIZON)
X_test, y_test = sliding_windows(test, window_size, HORIZON)
training_data_per_window = train_model_and_record_data(
X_train, y_train, X_test, y_test, window_size, EPOCHS, device
)
# Add the data for this window to the main training_data list
training_data.extend([(window_size,) + data for data in training_data_per_window])
# Set the animation embed limit
matplotlib.rcParams["animation.embed_limit"] = 2**100
# Create the initial plot
fig, ax = plt.subplots()
x_coords = np.arange(len(scaled_timeseries))
(line1,) = ax.plot(x_coords, scaled_timeseries, c=custom_palette[0])
(line2,) = ax.plot(
x_coords, np.full_like(scaled_timeseries, np.nan), c=custom_palette[1]
)
(line3,) = ax.plot(
x_coords, np.full_like(scaled_timeseries, np.nan), c=custom_palette[2]
)
# Animation update function
def animate(i):
line2.set_ydata(training_data[i][2])
line3.set_ydata(training_data[i][3])
ax.set_title(
f"Window Size: {training_data[i][0]} | Epoch: {training_data[i][1]+1} | Train RMSE: {training_data[i][4]:.4f} | Test RMSE: {training_data[i][5]:.4f}"
)
return (
line1,
line2,
line3,
)
# Create the animation
ani = animation.FuncAnimation(
fig, animate, frames=len(training_data), interval=100, blit=True, repeat=True
)
# Display the animation
HTML(ani.to_jshtml())
Input shape: (2, 1) | Target shape: (1, 1)
Input shape: (2, 1) | Target shape: (1, 1)
Input shape: (3, 1) | Target shape: (1, 1)
Input shape: (3, 1) | Target shape: (1, 1)
Input shape: (4, 1) | Target shape: (1, 1)
Input shape: (4, 1) | Target shape: (1, 1)
Input shape: (6, 1) | Target shape: (1, 1)
Input shape: (6, 1) | Target shape: (1, 1)
Input shape: (8, 1) | Target shape: (1, 1)
Input shape: (8, 1) | Target shape: (1, 1)
Input shape: (10, 1) | Target shape: (1, 1)
Input shape: (10, 1) | Target shape: (1, 1)
Input shape: (15, 1) | Target shape: (1, 1)
Input shape: (15, 1) | Target shape: (1, 1)
Input shape: (20, 1) | Target shape: (1, 1)
Input shape: (20, 1) | Target shape: (1, 1)